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Abstract —We investigate the capability of localizing node 
failures in communication networks from binary states (nor¬ 
mal/failed) of end-to-end paths. Given a set of nodes of interest, 
uniquely localizing failures within this set requires that different 
observable path states associate with different node failure events. 
However, this condition is difficult to test on large networks 
due to the need to enumerate all possible node failures. Our 
first contribution is a set of sufficient/necessary conditions for 
identifying a bounded number of failures within an arbitrary 
node set that can be tested in polynomial time. In addition to 
network topology and locations of monitors, our conditions also 
incorporate constraints imposed by the probing mechanism used. 
We consider three probing mechanisms that differ according to 
whether measurement paths are (i) arbitrarily controllable, (ii) 
controllable but cycle-free, or (iii) uncontrollable (determined 
by the default routing protocol). Our second contribution is 
to quantify the capability of failure localization through (1) 
the maximum number of failures (anywhere in the network) 
such that failures within a given node set can be uniquely 
localized, and (2) the largest node set within which failures 
can be uniquely localized under a given bound on the total 
number of failures. Both measures in (1-2) can be converted 
into functions of a per-node property, which can be computed 
efficiently based on the above sufficient/necessary conditions. We 
demonstrate how measures (1-2) proposed for quantifying failure 
localization capability can be used to evaluate the impact of 
various parameters, including topology, number of monitors, and 
probing mechanisms. 

Index Terms —Network Tomography, Failure Localization, 
Identlfiablllty Condition, Maximum Identlfiability Index 

I. Introduction 

Effective monitoring of network performance is essential 
for network operators in building reliable communication 
networks that are robust to service disruptions. In order to 
achieve this goal, the monitoring infrastructure must be able to 
detect network misbehaviors (e.g., unusually high loss/latency, 
unreachability) and localize the sources of the anomaly (e.g., 
malfunction of certain routers) in an accurate and timely man¬ 
ner. Knowledge of where problematic network elements reside 
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in the network is particularly useful for fast service recovery, 
e.g., the network operator can migrate affected services and/or 
reroute traffic. However, localizing network elements that 
cause a service disruption can be challenging. The straightfor¬ 
ward approach of directly monitoring the health of individual 
elements is not always feasible due to traffic overhead, access 
control, or lack of protocol support at internal nodes. More¬ 
over, built-in monitoring agents running on network elements 
cannot detect problems caused by misconfigured/unanticipated 
interactions between network layers, where end-to-end com¬ 
munication is disrupted but individual network elements along 
the path remain functional (a.k.a. silent failures) m. These 
limitations call for a different approach that can diagnose 
the health of network elements from the health of end-to-end 
communications perceived between measurement points. 

One such approach, generally known as network tomog¬ 
raphy E, focuses on inferring internal network character¬ 
istics based on end-to-end performance measurements from 
a subset of nodes with monitoring capabilities, referred to 
as monitors. Unlike direct measurement, network tomography 
only relies on end-to-end performance (e.g., path connectivity) 
experienced by data packets, thus addressing issues such as 
overhead, lack of protocol support, and silent failures. In cases 
where the network characteristic of interest is binary (e.g., 
normal or failed), this approach is known as Boolean network 
tomography a. 

In this paper, we study an application of Boolean network 
tomograplw to localize node failures from measurements of 
path stateo Under the assumption that a measurement path is 
normal if and only if all nodes on this path behave normally, 
we formulate the problem as a system of Boolean equations, 
where the unknown variables are the binary node states, and 
the known constants are the observed states of measurement 
paths. The goal of Boolean network tomography is essentially 
to solve this system of Boolean equations. 

Because the observations are coarse-grained (path 
normal/failed), it is usually impossible to uniquely identify 
node states from path measurements. For example, if two 
nodes always appear together in measurement paths, then 
upon observing failures of all these paths, we can at most 
deduce that one of these nodes (or both) has failed but 

* This model can also capture link failures by transforming the topology into 
a logical topology with each link represented by a virtual node connected to 
the nodes incident to the link. 
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cannot determine which one. Because there are often multiple 
explanations for given path failures, existing work mostly 
focuses on finding the minimum set of failed nodes that most 
probably involves failed nodes. Such an approach, however, 
does not guarantee that nodes in this minimum set have 
failed or that nodes outside the set have not. Generally, to 
distinguish between two possible failure sets, there must exist 
a measurement path that traverses one and only one of these 
two sets. There is, however, a lack of understanding of what 
this requires in terms of observable network properties such 
as topology, monitor placement, and measurement routing. 
On the other hand, even if there exists ambiguity in failure 
localization across the entire network, it is still possible to 
uniquely localize node failures in a specific sub-network (e.g., 
sub-network with a large fraction of monitors). To determine 
such unique failure localization in sub-networks, we need to 
understand how it is related to network properties. 

In this paper, we consider three closely related problems: 
Let S denote a set of nodes of interest (i.e., there can be am¬ 
biguity in determining the states of nodes outside S', however, 
the states of nodes in S must be uniquely determinable). (1) 
If the number of simultaneous node failures is bounded by k, 
then under what conditions can one uniquely localize failed 
nodes in S from path measurements available in the entire 
network? (2) What is the maximum number of simultaneous 
node failures (i.e., the largest value of k) such that any failures 
within S can be uniquely localized? (3) What is the largest 
node set within which failures can be uniquely localized, if the 
total number of failures is bounded by kl Answers to ques¬ 
tions (2) and (3) together quantify a network’s capability to 
localize failures from end-to-end measurements: question (2) 
characterizes the scale of failures and question (3) the scope of 
localization. Clearly, answers to the above questions depend on 
which paths are measurable, which in turn depends on network 
topology, placement of monitors, and the routing mechanism 
of probes. We will study all these problems in the context of 
the following classes of probing mechanisms: (i) Controllable 
Arbitrary-path Probing (CAP), where any measurement path 
can be set up by monitors, (ii) Controllable Simple-path Prob¬ 
ing (CSP), where any measurement path can be set up, pro¬ 
vided it is cycle-free, and (iii) Uncontrollable Probing (UP), 
where measurement paths are determined by the default rout¬ 
ing protocol. These probing mechanisms assume different lev¬ 
els of control over routing of probing packets and are feasible 
in different network scenarios (see Section III-CI) : answers to 
the above three problems under these probing mechanisms thus 
provide insights on how the level of control bestowed on the 
monitoring system affects its capability in failure localization. 

A. Related Work 

Existing work can be broadly classified into single failure 
localization and multiple failure localization. Single failure lo¬ 
calization assumes that multiple simultaneous failures happen 
with negligible probability. Under this assumption, a, a 
propose efficient algorithms for monitor placement such that 
any single failure can be detected and localized. To improve 
the resolution in characterizing failures, range tomography in 


m not only localizes the failure, but also estimates its severity 
(e.g., congestion level). These works, however, ignore the fact 
that multiple failures occur more frequently than one may 
imagine 111. In this paper, we consider the general case of 
localizing multiple failures. 

Multiple failure localization faces inherent uncertainty. Most 
existing works address this uncertainty by attempting to find 
the minimum set of network elements whose failures explain 
the observed path states. Under the assumption that failures are 
low-probability events, this approach generates the most prob¬ 
able failure set among all possibilities. Using this approach, 
0, 13 propose solutions for networks with tree topologies, 
which are later extended to general topologies in HI. Similarly, 
II3 proposes to localize link failures by minimizing false pos¬ 
itives; however, it cannot guarantee unique failure localization. 
In a Bayesian formulation, M proposes a two-stage solution 
which first estimates the failure (loss rate above threshold) 
probabilities of different links and then infers the most likely 
failure set for subsequent measurements. By augmenting path 
measurements with (partially) available control plane infor¬ 
mation (e.g., routing messages), m, ini propose a greedy 
heuristic for troubleshooting network unreachability in multi- 
AS (Autonomous System) networks that has better accuracy 
than benchmarks using only path measurements. 

Little is known when we insist on uniquely localizing 
network failures. Given a set of monitors known to uniquely 
localize failures on paths between themselves, iflTll develops an 
algorithm to remove redundant monitors such that all failures 
remain identifiable. If the number of failed links is upper 
bounded by k and the monitors can probe arbitrary cycles 
or paths containing cycles, proves that the network must 
be (fc-|-2)-edge-connected to identify any failures up to k links 
using one monitor, which is then used to derive requirements 
on monitor placement for general topologies. Solving node 
failure localization using the results of Ha, however, requires 
a topology transformation that maps each node to a link 
while maintaining adjacency between nodes and feasibility 
of measurement paths. To our knowledge, no such transfor¬ 
mation exists whose output satisfies the assumptions of ifTSll 
(undirected graph, measurement paths not containing repeated 
links). Later, M proves that under a CAP-like probing mech¬ 
anism, the condition can be relaxed to the network being k- 
edge-connected. Both 03, m focus on placing monitors and 
constructing measurement paths to localize a given number of 
failures; in contrast, we focus on characterizing the capability 
of failure localization under a given monitor placement and 
constraints on measurement paths. In previous work ifTTll . we 
propose efficient testing conditions and algorithms to quantify 
the capability of localizing node failures in the entire network; 
however, we did not consider the case that even if some node 
states cannot be uniquely determined, we may still be able to 
unambiguously determine the states of some other nodes. In 
this paper, we thus investigate the relationships between the ca¬ 
pability of localizing node failures and explicit network prop¬ 
erties such as topology, placement of monitors, probing mech¬ 
anism, and nodes of interest, with focus on developing efficient 
algorithms to characterize the capability under given settings. 

A related but fundamentally different line of work is graph- 
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constrained group testing m, which studies the minimum 
number of measurement paths needed to uniquely localize a 
given number of (node/link) failures, using a CAP-like probing 
mechanism. In contrast, we seek to characterize the type of 
failures (number and location) that can be uniquely localized 
using a variety of probing mechanisms. 

B. Summary of Contributions 

We study the fundamental capability of a network with 
arbitrarily placed monitors to uniquely localize node failures 
from binary end-to-end measurements between monitors. Our 
contributions are five-fold; 

1) We propose two novel measures to quantify the capability 
of failure localization, (i) maximum identifiability index of a 
given node set, which characterizes the maximum number of 
simultaneous failures such that failures within this set can 
be uniquely localized, and (ii) maximum identifiable set for 
a given upper bound on the number of simultaneous failures, 
which represents the largest node set within which failures can 
be uniquely localized if the failure event satisfies the bound. 
We show that both measures can be expressed as functions 
of per-node maximum identifiability index (i.e., maximum 
number of failures such that the failure of a given node can 
be uniquely determined). 

2) We establish necessary/sufficient conditions for uniquely 
localizing failures in a given set under a bound on the total 
number of failures, which are applicable to all probing mech¬ 
anisms. We then convert these conditions into more concrete 
conditions in terms of network topology and placement of 
monitors, under the three different probing mechanisms (CAP, 
CSP, and UP), which can be tested in polynomial time. 

3) We show that a special relationship between the above 
necessary/sufficient conditions leads to tight upper/lower 
bounds on the maximum identifiability index of a given set that 
narrows its value to at most two consecutive integers. These 
conditions also enable a strategy for constructing inner/outer 
bounds (i.e., subset/superset) of the maximum identifiable set. 
These bounds are polynomial-time computable under CAP 
and CSP. While they are NP-hard to compute under UP, we 
present a greedy heuristic to compute a pair of relaxed bounds 
that frequently coincide with the original bounds in practice. 

4) We evaluate the proposed measures under different prob¬ 
ing mechanisms on random and real topologies. Our evaluation 
shows that controllable probing, especially CAP, significantly 
improves the capability of node failure localization over un¬ 
controllable probing. Our result also reveals novel insights into 
the distribution of per-node maximum identifiability index and 
its relationship with graph-theoretic node properties. 

Note: Our results are also applicable to transient failures 
as long as node failures persist during probing (i.e., leading 
to failures of all traversing paths). We have limited our obser¬ 
vations to binary states (normal/failed) of measurement paths. 
It is possible in some networks to obtain extra information 
from probes, e.g., rerouted paths after a default path fails, 
in which case our solution provides lower bounds on the 
capability of localizing failures. Furthermore, we do not make 
any assumption on the distribution or correlation of node 


TABLE I 

Graph-related Notations 


Symbol 

Meaning 

V, L 

set of nodes/links (^ := |Z/|) 

M, N 

set of monitors/non-monitors (M U N = V, 
p~\M\, a~ |iV|) 

k 

maximum number of simultaneous non-monitor 
failures 

V{Q) 

set of nodes in g 

M{M) 

set of non-monitors that are neighbors of at least 
one monitor in M {6 := \Af{M)\) 

£(U, W) 

C{V, W) — {link vw ■. Vu G U, m G W, 

V f w} 

g-L' 

delete links: Q — L' = (U, L \ L'), where “\” 
is setminus 

g + L' 

add links: Q + L' = {V,L U L'), where the 
end-points of links in L' must be in V 

g-v 

delete nodes: Q -V = {V\V',L\ L{V')), 
where L{V') is the set of links incident to nodes 
in V 

g + v 

add nodes: g + V' = iVU V, L) 

g* 

auxiliary graph of g (see Fig. O 


auxiliary graph of g w.r.t. monitor m (see 
Fig. ID 

g' 

extended graph of g (see Fig. 

n{s), n{v) 

maximum identifiability index of 5 or u (S: a 
set of nodes, v. a node) 

S*{k) 

maximum ^-identifiable set 

S“(fc) 

subset of S*{k) 

S“'“"{k) 

superset of S*{k) 


failures across the network. In some application scenarios 
(e.g., datacenter networks), node failures may be correlated 
(e.g., all routers sharing the same power/chiller). We leave the 
characterization of failure localization in the presence of such 
additional information to future work. 

The rest of the paper is organized as follows. Section HJ 
formulates the problem. Section |III] presents the theoretical 
foundations for identifying node failures, followed by 
verifiable identifiability conditions for specific classes of 
probing mechanisms in Section |IV] Based on the derived 
conditions, tight bounds on the maximum identifiability 
index are presented in Section lYl and inner/outer 
bounds on the maximum identifiable set are established 
in Section |VT] We evaluate the established bounds on various 
synthetic/real topologies in Section I VII I to study the impact 
of various parameters (topology, number of monitors, probing 
mechanism) on the capability of node failure localization. 
Finally, Section IVIIII concludes the paper. 

II. Problem Formulation 
A. Models and Assumptions 

We assume that the network topology is known and model 
it as an undirected grapI0 Q = (y,L), where V and L are the 

^We use the terms network and graph interchangeably. 
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sets of nodes and links. In Q, the number of neighbors of node 
V is called the degree of u; ^ := \L\ denotes the number of 
links. Note that graph Q can represent a logical topology where 
each node in Q corresponds to a physical subnetwork. Without 
loss of generality, we assume Q is connected, as different 
connected components have to be monitored separately. 

A subset of nodes M (M C V) are monitors that can initiate 
and collect measurements. The rest of the nodes, denoted 
by TV := 1/ \ M, are non-monitors. Let p, := |TVf| and 
a := I TV I denote the numbers of monitors and non-monitors. 
We assume that monitors do not fail during the measurement 
process, as failed monitors can be directly detected and ex¬ 
cluded (assuming centralized control within the monitoring 
system). Non-monitors, on the other hand, can fail, and a 
failure event may involve simultaneous failures of multiple 
non-monitors. Depending on the adopted probing mechanism, 
monitors measure the states of nodes by sending probes along 
certain paths. Let P denote the set of all possible measurement 
paths', for given Q and M, different probing mechanisms can 
lead to different sets of measurement paths, which will be 
specified later. We use node state (path state) to refer to the 
binary state, failed or normal, of a node (path), where a path 
fails if and only if at least one node on the path fails. Table U 
summarizes graph-related notations used in this paper. 

Let w = (kLi, •. ■, Wa)'^ be the binary column vector of the 
states of all non-monitors and c = (Ci,..., C'.y)^ the binary 
column vector (7 = |P|) of the states of all measurement 
paths. For both node and path states, 0 represents “normal” 
and 1 represents “failed”. We relate the path states to the node 
states through the following Boolean linear system: 

R 0 w = c, (1) 

where R = {Rij) is a 7 x cr measurement matrix, with each 
entry Rij € { 0 , 1 } denoting whether non-monitor Vj is present 
on path Vi (1: yes, 0: no), and “ 0 ” is the Boolean matrix 
product, i.e., Ci = \/j^i{Rij A Wj). The goal of Boolean 
network tomography is to invert this Boolean linear system 
to solve for all/part of the elements in w given R and c. 
Intuitively, for a node set S (S C TV), any node failures in 
S are identifiable if and only if the corresponding states of S 
in w are uniquely determinable by ([T]). 

B. Definitions 

Let failure set F be a set of non-monitors (F C TV) that 
fail simultaneously. Note that the collection of all failure sets 
in a given network covers all possible failure scenarios (each 
corresponds to a failure set) that can occur in this network; 
the goal of failure localization is to infer the current failure set 
from the states of measurement paths. The challenge for this 
problem is that there may exist multiple failure sets leading 
to the same path states, causing ambiguity. Let Pp denote 
the set of all measurement paths affected by a failure set F 
(i.e., paths traversing at least one node in F). To quantify the 
capability of uniquely determining the failure set, we introduce 
the following definitions. 

Definition 1. Given a network Q and a set of measurement 
paths P, two failure sets Fi and F 2 are distinguishable if and 


only if Pp^ f Pp 2 , i.e., 3 a path that traverses one and only 
one of Fi and F 2 . 

Definition [T] implies that two potential failure sets must be 
associated with different observable path states for monitors 
to determine which set of nodes have failed. While uniquely 
localizing arbitrary failures requires all subsets of TV to be 
pairwise distinguishable, we can relax this requirement by only 
considering failure sets of size bounded by fc (fc > 1), which 
represents the scale of probable failure events. Moreover, in 
practice, we are usually interested in the states of a subset of 
nodes S (S F TV), in which case the goal is to only ensure 
unique failure localization within S. Note that failures (F) 
may occur anywhere in the network (F C TV) and are not 
restricted to S. 

Definition 2. Given a network Q (with non-monitor set TV) 
and a node set S of interest (S F N): 

1) S is fc-identifiable if for any two failure sets Fi and F 2 
satisfying (1) |Fi| < k (i = 1,2) and (2) Fins' F 2 nS, 
Fi and F 2 are distinguishable. 

2) The maximum identifiability index of S, denoted by 
n(S), is the maximum value of k such that S is k- 
identifiable. 

Intuitively, if a node set S is fc-identifiable, then the states 
(normal/failed) of all nodes within this set are unambiguously 
determinable from the observed path states, provided the total 
number of failures (anywhere in the network) is bounded by 
k. The maximum identifiability index fl(S) characterizes the 
network’s capability to uniquely localize failures in S. Defini¬ 
tion | 2 ] generalizes the notion of network-wide fc-identifiability 
and maximum identifiability index introduced in El, where 
only the case of S = TV was considered. In the special case of 
S = {u}, we say that node v is fc-identifiable; the maximum 
identifiability index of S = {u} is denoted by I7(z;). Note that 
the subset of a fc-identifiable set is also fc-identifiable. We are 
therefore interested in the maximum such set. 

Definition 3. Given k, the maximum k-identifiable set, denoted 
by S*{k), is the largest-cardinality non-monitor set that is k- 
identifiable. 

According to Definition [3 it seems that the maximum fc- 
identifiable set is defined based on its cardinality, and thus 
may not be unique. Nevertheless, we prove in Section IIII-BI 
that S*(k) is unique. The significance of the maximum fc- 
identifiable set is that it measures the completeness of the 
inferred network state: it contains all nodes whose states can 
be inferred reliably from the observed path states, as long as 
the total number of failures in the network is bounded by fc. 
Note that fc is a design parameter capturing the scale of failures 
that the system is designed to handle. 

C. Classification of Probing Mechanisms 

The above definitions are all defined with respect to (w.r.t.) 
a given set of measurement paths P. Given the topology Q and 
monitor locations M, the probing mechanism plays a crucial 
role in determining P. Depending on the flexibility of probing 
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and the cost of deployment, we classify probing mechanisms 
into one of three classes: 

1) Controllable Arbitrary-path Probing (CAP): P includes 
any path/cycle, allowing repeated nodes/links, provided 
each path/cycle starts and ends at monitors. 

2) Controllable Simple-path Probing (CSP): P includes any 
simple path between distinct monitors, not including 
repeated nodes. 

3) Uncontrollable Probing (UP): P is the set of paths 
between monitors determined by the routing protocol 
used by the network, not controllable by the monitors. 

Although CAP allows probes to traverse each node/link an 
arbitrary number of times, it suffices to consider paths where 
each probe traverses each link at most once in either direction 
for the sake of localizing node failures. 

These probing mechanisms clearly provide decreasing 
flexibility to the monitors and therefore decreasing capability 
to localize failures. However, they also offer decreasing 
deployment cost. CAP represents the most flexible probing 
mechanism and provides an upper bound on failure localiza¬ 
tion capability. In traditional networks, CAP is feasible at the 
IP layer if (strict) source routing (an IP option) ifT^ is enabled 
at all nodefl, or at the application layer (to localize failures in 
overlay networks) if equivalent “source routing” is supported 
by the application. Moreover, CAP is also feasible under 
an emerging networking paradigm called software-defined 
networking (SDN) 1201 . HTH . where monitors can instruct 
the SDN controller to set up arbitrary paths for the probing 
traffic. In particular, an SDN consisting of OpenFlow switches 
ED can set up paths by configuring the flow table of each 
traversed OpenFlow switch to forward a probing flow (e.g., 
one TCP connection) to a next hop based on the ingress port 
and the flow identifier, which allows the path to have repeated 
nodes/links. In contrast, UP represents the most basic probing 
mechanism, feasible in any network supporting data forward¬ 
ing, that provides a lower bound on the capability of failure 
localization. CSP represents an intermediate case that allows 
control over routing while respecting a basic requirement that 
routes must be cycle-free. CSP is implementable by MPLS 
(Multiprotocol Label Switching), where the “explicit routing” 
mode II 22 II allows one to set up a controllable, non-shortest 
path using labels so long as the path are cycle-free. Note that 
the cycle-free constraint here is crucial, as data forwarding in 
MPLS will encounter forwarding loops if a path has cycles. 

The signihcance of these three probing mechanisms is that 
they capture the main features of several existing and emerging 
routing techniques. Specihcally, UP is generally supported 
in existing networks without special configuration, CSP is 
feasible in some of today’s networks running MPLS with cer¬ 
tain configuration (i.e., label propagation via explicit routing), 
while CAP represents the capability of future networks once 
SDN is broadly deployed. 

^Source routing allows nodes to modify the source and the destination 
addresses in packet headers hop by hop along the path prescribed by a monitor. 
The probe can follow the reverse path to return to the original monitor, thus 
effectively probing any path with at least one end at a monitor. 



Fig. 1. Sample network with three monitors: mi, m 2 , and m 3 . 


Discussion: In (E3, “m-trail” (monitoring trails) is em¬ 
ployed as a probing mechanism in all-optical networks, where 
measurement paths can contain repeated nodes but not re¬ 
peated links. It is unclear which routing protocols in communi¬ 
cation networks select paths under the restriction of “m-trails”, 
we thus do not consider such a probing mechanism in this pa¬ 
per. In lIT^. another probing mechanism “m-tour” (monitoring 
tours) is used, which allows both repeated nodes and repeated 
links in measurement paths; “m-tour” is equivalent to CAP. 

In this paper, we quantify how the flexibility of a probing 
scheme affects the network’s capability to localize failures. 
Although concrete results are only provided for the above 
classes of probing mechanisms, our framework and our ab¬ 
stract identifiability conditions (see Section Ull-Ab can also be 
used to evaluate the failure localization capabilities of other 
probing mechanisms. 


D. Objective 

Given a network topology Q, a set of monitors M, and a 
probing mechanism (CAP, CSP, or UP), we seek to answer 
the following closely related questions: (i) Given a node set 
of interest S and a bound k on the number of failures, can 
we uniquely localize up to k failed nodes in S from observed 
path states? (ii) Given a node set S, what is the maximum 
number of failures within S that can be uniquely localized? 
(iii) Given an integer k (1 < k < a), what is the largest node 
set that is ^-identifiable? We will study these problems from 
the perspectives of both theories and efficient algorithms. 


E. Illustrative Example 

Consider the sample network in Fig. [D with three monitors 
(TO 1 -TO 3 ) and four non-monitors (vi-v^). Under UP, suppose 
that the default routing protocol only allows the monitors to 
probe the following paths: Vi = mivim 2 , V 2 = m 2 V 4 ,m 3 , 
and V 3 = miV 2 V 4 m 3 , which form a measurement matrix R™: 

Wi IV2 W3 
Vi = /I 0 0 

'P 2 = ^ = ( 0 0 0 

-D Void 

where = 1 if and only if node vj is on path Vi. Then 
we have R“’’ © w = c, where c is the binary vector of path 
states observed at the destination monitors. Let S' := {vi,V 2 , 
V 4 ]. Based on Definition [2 we can verify that 0(5") = 2, 
and the maximum identifiable set 5*(1) = {ui,U 2 ,U 4 } and 
5*(2) = 5*(3) = 5*(4) = {t;i,t; 4 }. Under CSP, besides 
the three paths in (El, we can probe three additional paths: 
Va = m 2 U 3 m 3 , V 3 = miV 2 V 3 m 3 , and Vq = miV 2 Vim 2 , 
yielding an expanded measurement matrix in Q: 










6 


Vi — mivim2 

^2 — m2V4m3 
7^3 = miV2V4m3 
7^4 = m2V3m3 
7^5 = miV2V37n3 

Vq — m\V2V\m2 


Wi 

(1 

0 

W2 

0 

0 

1 

VKs 

0 

0 

0 

W4 

1 

0 

0 

1 

0 

0 

1 

1 

0 

V 1 

1 

0 

0 / 


R 


UP 


(3) 

Using the six paths in the maximum identifiability index 
of S' becomes Q,{S') = 3, and the maximum identifiable set 
is enlarged to 5'*(1) = 5'*(2) = 5'*(3) = {ui, U 2 , U 3 , U 4 } and 
5'*(4) = {ui, U 3 , t; 4 }, a notable improvement over UP. Finally, 
if CAP is supported, then we can send probes along a cycle 
7^7 = miV 2 iTii. In conjunction with the paths in @, this yields 
the measurement matrix in (|4]i; 


“Pi — m\V\m2 
7^7 = m\V2Tni 

V 4 — ^ 2^3 m 3 

'P2 = m2V4m3 





W 2 W^3 W 4 

0 0 0 \ 

1 0 0 I 

0 1 0 

0 0 1 / 


(4) 


Since the paths in (0 can independently determine the 
state of each non-monitor, we have 17(5") = 4 and S'*(l) = 
S*(2) = S*(3) = S*(4) = {vi,V 2 ,V 3 ,V 4 } under CAP, i.e., 
all failures can be uniquely localized. 

This example shows that the monitor placement and the 
probing mechanism significantly affect a network’s capability 
to localize failures. In the rest of the paper, we will study this 
relationship both theoretically and algorithmically. 


III. Theoretical Foundations 

We start with some basic understanding of failure identifi¬ 
ability. First, the definition of fc-identifiability in Definition |2] 
requires enumeration of all possible failure events and thus 
cannot be tested efficiently. To address this issue, we establish 
explicit sufficient/necessary conditions for fc-identifiability that 
apply to arbitrary probing mechanisms, which will later be 
developed into verifiable conditions for the three classes of 
probing mechanisms. Moreover, we establish several desirable 
properties of maximum identifiability index (Definition |2]i 
and maximum identifiable set (Definition |3ll, which greatly 
simplify the computation of these measures. 


Most existing solutions for (nonadaptive) group testing aim 
at constructing a disjunct testing matrix. Specifically, a testing 
matrix i? is a binary matrix, where Rij = 1 if and only if 
element j is included in the i-th test. Matrix R is k-disjunct 
if the Boolean sum of any k columns does not “contain” any 
other columrU ||25l. In our problem, the existence of a disjunct 
testing matrix translates into the following conditions. 

Lemma 4. Set S is k-identifiable: 

a) if for any failure set F with 1^1 < fc and any node v with 
V G S \ F, B p G P traversing v but none of the nodes 
in F; 

b) only if for any failure set F with 1^1 < fc — 1 and any 
node V with v G S \ F, 3 p G P traversing v but none 
of the nodes in F. 

Proof: Consider two distinct failure sets Fi and F 2 with 
Fins' 7 ^ F 2 nS, each containing no more than k nodes. There 
exists a node u S S in only one of these sets; suppose v G Fi\ 
F 2 . By the condition in the lemma, 3 a path p traversing v but 
not F 2 , thus distinguishing Fi from F 2 . Therefore, condition 
aj in Lemma 0] is sufficient. 

Suppose 3 a non-empty set F with |F| < k—1 and v G S\F 
such that all measurement paths traversing v must also traverse 
at least one node in F. Therefore, for two failure sets F and 
F U {u} satisfying conditions (1-2) in Definition |2}(1) are 
not distinguishable as Fp = Pfu{v}- Thus, condition b) in 
Lemma m is necessary. ■ 

These conditions generally apply to any probing mecha¬ 
nism. Although in the current form, they do not directly lead 
to efficient testing algorithms, we will show later (Section HVl) 
that they can be transformed into verifiable conditions for 
several classes of probing mechanisms. 


B. Properties of the Maximum Identifiability Index and the 
Maximum Identifiable Set 

Although the maximum identifiability index 17(5) and the 
maximum fc-identifiable set S*{k) are defined for sets of 
nodes, we show below that they can both be characterized 
in terms of a per-node property, which greatly simplifies the 
computation of these measures. We start with the following 
two observations. 


A. Abstract Identifiability Conditions 

Our identifiability condition is inspired by a result known 
in a related field called combinatorial group testing ll 2 ^ . 
In short, group testing aims to find abnormal elements in a 
given set by running tests on subsets of elements, each test 
indicating whether any element in the subset is abnormal. 
This is analogous to our problem where abnormal elements are 
failed nodes and tests are conducted by probing measurement 
paths. A subtle but critical difference is that in our problem, the 
subsets of elements that can be tested together are constrained 
by the set of measurement paths F, which is in turn limited by 
the topology, probing mechanism, and placement of monitor^ 


Lemma 5. a) If S is k-identifiable, then any v G S must be 
k-identifiable. 

b) If V is k-identifiable \/v G S, then S is k-identifiable. 

Proof: a) Suppose 3 node v G S that is not fc-identifiable, 
then 3 at least two failure sets Fi and F 2 with |Fi| < fc 
(i = { 1 , 2 }) and Fi fl {u} 7 ^ F 2 fl {u} such that Fi and F 2 are 
not distinguishable. Thus, S is not fc-identifiable as u € S'. 

b) For any two failure sets Fi and F 2 with |Fi| < fc (i = {1, 
2}) and Fi fl S 7 ^ F 2 fl S, 3 a node v G S that is either in Fi 
or F 2 but not both. Since node v is fc-identifiable, Fi and F 2 
must be distinguishable. Therefore, S is fc-identifiable. ■ 

Proposition 6. f7(S) = min^gg f7(u). 


‘*In this regard, our problem is similar to a variation of group testing under 
graph constraints (U; see Section II-Al for the difference. 


^That is, for any subset of k column indices S and any other column index 
j ^ S, 3a row index i such that Rij = 1 and Rij' = 0 for all j' £ S. 
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Proof: By Lemma |5}(a), any v € S must have n{v) > 
^1{S). Thus, min„gs il{v) > By the definition of max¬ 

imum identifiability index, all nodes in S are min„gsn(u)- 
identifiable. By Lemma |5}(b), S is also min„ GS ^{v)- 
identifiable. Thus, fl{S) > min^igs Therefore, D,{S) = 
min^gS ■ 

Corollary 7. Maximum identifiability index of S, ^1{S), is 
monotonically non-increasing in the sense that n(S'i) > 
^{Sf) for any two non-empty sets Si and S 2 with Si C S' 2 - 

Proof: Since Si C S' 2 , min„gSj St{v) > mint, GS 2 ^(^)- 
Therefore, by Proposition | 6 ] n(Si) > n(S 2 ). ■ 

Therefore, we can estimate the maximum identifiability 
index of a given non-monitor set using Corollary |7] when the 
maximum identifiability index of its subset/superset is known. 

Next, we show that maximum fc-identifiable sets exhibit 
properties that can facilitate fast determination of which nodes 
should be included/excluded in these sets. 

Proposition 8. Let S'{k) := {u € TV : u is k-identifiable }. 
Then S'{k) = S*{k). 

Proof: By Lemma |5}(a), any node in S*{k) is fc- 
identifiable. Therefore, S*(fc) C S'(fc). 

Next, S'(fc) must be fc-identifiable according to Lemma |5]- 
(b). Thus |S'(fc)| < |S*(fc)|. Consequently, S'(fc) = S*(fc). 

■ 

Proposition H] provides a method to construct the maximum 
fc-identifiable set S*(fc) by simply collecting all fc-identifiable 
nodes. Based on this method, we can further prove the 
uniqueness and monotonicity of S*{k) as follows: 

Corollary 9. The maximum k-identifiable set S*{k) is unique 
and monotonically non-increasing in k, i.e., S'*(fc-|-1) C 
S*{k) for any k. 

Proof: Definition |2] implies that fc-identifiability is a per- 
node property that is independent of the identifiability of other 
nodes. Therefore, for each node in TV, it is either fc-identifiable 
or not fc-identifiable. By Proposition [8] S*{k) is a set contain¬ 
ing all fc-identifiable nodes; therefore, S*{k) is unique. 

For each node w € N \ S*{k), w is not fc-identifiable, and 
thus w is not (fc-l-l)-identifiable. Since S'*(fc-|-1) is a collection 
of all (fc -b 1)-identifiable nodes, no nodes in TV \ S*{k) can 
be included in S*{k + 1). Thus, S*{k -b 1 ) C S*{k). ■ 

Intuitively, if there exists a fc-identifiable set S'{k) with 
|S"(fc)| = |S'*(fc)|, then we must have S"(fc) = S*{k). Thus, 
Corollary |9] suggests one way to obtain S* (fc) is to identify 
S*{j) for j < k and then only study subsets of S*{j)', nodes 
outside S*{j) are guaranteed to be excluded from S*{k). 

Corollary 10. Let S"{k) := {v € N : 3 path in P traversing 
V but none of the nodes in each failure set F with v F and 
IF'I < fc}. Then S'fk) C S*{k). 

Proof: S"{k) satisfies sufficient condition a) in Lemma|4] 
Thus, fl(S'"(fc)) > fc. Following similar arguments as in the 
proof of Proposition |8l again we have that each node in S"{k) 
is at least fc-identifiable. Therefore, S"{k) C S*{k). ■ 

By Corollary [TO] we note that S"{k) underestimates the 
size of the maximum fc-identifiable set S*{k), yet it forms an 


inner bound (i.e., subset) of S*{k), thus providing theoretical 
support for determining the must-have nodes in the optimum 
set S*{k)\ see detailed discussions presented in Section IVll 

Remark: Results in this section apply to any probing mecha¬ 
nism. We will show in the following sections how they can be 
used to design efficient algorithms for probing mechanisms 
CAP, CSP, and UP. The above results can also be used to 
design algorithms for other probing mechanisms. 

IV. Verifiable Identifiability Conditions 

In this section, starting from the abstract conditions in 
Section IIII-AI we develop concrete conditions suitable for 
efficient testing for the three classes of probing mechanisms. 

A. Conditions under CAP 

CAP essentially allows us to “ping” any node from a moni¬ 
tor along any path. In the face of failures, this allows a monitor 
to determine the state of a node as long as it is connected to 
the node after removing other failed nodes. This observation 
allows us to translate the conditions in Section lTlI-Al into more 
concrete identifiability conditions (Lemma [TTJ. 

Lemma 11. Set S is k-identifiable under CAP if and only if 
for any set V of up to k — 1 non-monitors, each connected 
component in Q — V that contains a node in S has a monitor. 

Proof: Necessity. Suppose the above condition does not 
hold, i.e., there exists a non-monitor v (v € S) that is 
disconnected from all monitors in Q — V' for a set V' of 
up to fc — 1 non-monitors {v ^ V). Then if nodes in V' fail, 
no remaining measurement path can probe v, violating the 
condition in Lemma |4}(b). 

Sufficiency. The proof is similar to that of Theorem 2 in 
na, except that we are only interested in localizing failures 
in S. Consider two failure sets Fi and F 2 with |Fi| < fc 
(i = {1, 2}) and Fi Ci S f F 2 Ci S. Then 3 node v (v G S) 
that is in one and only one of Fi and F 2 . Without loss of 
generality, let v G Fi. Let / := Fi n F 2 . Since |/| < fc — 1, 3 
a path p connecting a monitor m with node v in Q — I if the 
condition in Lemma [TT] holds. Let w be the first node on p 
(starting from m) that is in either Fi\I or F 2 \ L Truncating 
p at w gives a path p' such that p' and its reverse path form a 
measurement path from m to w and back to m that traverses 
only Fi or F 2 , thus distinguishing Fi and F 2 . ■ 

Under CAP, Lemma [TT] shows that the necessary condition 
derived from Lemma |4| is also sufficient. However, the condi¬ 
tion in Lemma [TT] still cannot be tested efficiently because a 
combinatorial number of sets V are enumerated. Fortunately, 
we can reduce Lemma [TT] into explicit conditions on vertex- 
cuts of a related topology, which can then be tested in poly¬ 
nomial time. We use the following notion from graph theory. 

Definition 12. For two nodes s and t in an undirected graph 
Q, (s, t)-vertex-cut in Q, denoted by Cg(s, t), is the minimum- 
cardinality node set whose deletion destroys all paths from s 
to t. If s and t are neighbors, Cg{s,f) := V(G) \ {T}- 

Our key observation is that requiring each connected com¬ 
ponent in G — V' that contains a node in S to have a 
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Fig. 2. Auxiliary graphs: (a) Original graph Q; (b) Q* of Q; (c) Gmi of G 
w.r.t. monitor mi. 


monitor is equivalent to requiring each such component in 
Q — M — V' (i.e., after removing all monitors) to contain 
a neighbor of a monitor. Thus, if we extend Q — M hy 
adding a virtual monitor m! and virtual links connecting m! 
and all neighbors of monitors to obtain an auxiliary graph 
Q* := Q — M + {m'} + N{M)') (illustrated in 

Fig.|2](b)), then each node in S\V' should be connected to m' 
in Q*—V'. In other words, the minimum cardinality of the [m', 
r(;)-vertex-cut in Q* over all w G S' must be greater than \V'\. 
For ease of presentation, we introduce the following definition. 

Definition 13. Given a graph Q, a node set S, and a node 
m ^ S, define Tg{S,m) := minuigs \Cg{w,m)\. 

By this definition. Lemma [TT] can be transformed into a 
new condition, which reduces the tests over all possible V' to 
a single test of the vertex-cuts of Q*, as stated below (recall 
that a is the total number of non-monitors). 

Lemma 14. Each connected component in Q — V' that 
contains a node in S has a monitor for any set V' of up to q 
(q < (T — 1) non-monitors if and only if Tg*[S,m') > q -f 1. 

Proof: The proof can be found in ll^ . ■ 

Lemma fT4l allows us to rewrite the identifiability conditions 
in Lemma [TT] in terms of the vertex-cuts of Q*. 

Theorem 15 (Ic-identifiability under CAP). Set S is k- 
identifiable (k < a) under CAP if and only ifTg* (S, m') > k. 

A special case of Theorem [TSl occurs when k = a, i.e., any 
non-monitors can fail simultaneously. In this case, each node 
in S must directly connect to at least one monitor in Q. 

Discussion: Theorem [15] extends and improves the identifi¬ 
ability condition given in Theorem 2 of m by (i) considering 
failures within an arbitrary subset of nodes instead of the entire 
network, and (ii) providing a single condition that can be tested 
in polynomial time (see testing algorithm below) instead of 
testing a combinatorial number of conditions that enumerate 
all possible failure events. 

Testing algorithm: A key advantage of the newly derived 
conditions over the abstract conditions in Section Illl-Al is that 
they can be tested efficiently. Let 0 := \J\f{M)\ denote the 
number of non-monitors that are neighbors of at least one 
monitor in M. Given node w, Cg<-{w,m') can be computed 
in 0{9^) tim^E where ^ is the number of links (refer to 


^The (m', ro)-vertex-cut problem in an undirected graph can be reduced 
to an (m', ui)-edge-cut problem in a directed graph in linear time (27). The 
(m'jtD)-edge-cut problem is solvable by the Ford—Fulkerson algorithm du 
in 0(9^) time. 


Table[T]for notations). Therefore, we can evaluate rp.(S', m') 
in O(0^|S'|) time and compare the result with k to test the 
conditions in Theorem [Ts] 

B. Conditions under CSP 

Under CSP, we restrict measurement paths P to the set of 
simple paths between monitors, i.e., paths starting/ending at 
distinct monitors that contain no cycles. As in CAP, our goal is 
again to transform the abstract conditions in Section ITlI-Al into 
concrete sufficient/necessary conditions that can be efficiently 
verified. We first give analogous result to Theorem [TSl 

Lemma 16. Set S is k-identifiable under CSP: 

a) if for any node set V', \V'\ < k -\-1, containing at most 
one monitor, each connected component in Q — V' that 
contains a node in S also contains a monitor; 

b) only if for any node set V', \V'\ < k, containing at most 
one monitor, each connected component in Q — V' that 
contains a node in S also contains a monitor. 

Proof: The proof can be found in li^ . ■ 

Due to the restriction to simple paths, the identifiability con¬ 
ditions in Lemma[T6lare stronger than those in Lemma [TT] As 
with Lemma [TT] the conditions in Lemma [T^ do not directly 
lead to efficient tests, and we again seek equivalent conditions 
in terms of topological properties. Each condition in the form 
of Lemma [rblla-bl covers two cases: (i) V' only contains non¬ 
monitors; (ii) V contains a monitor and |U'| —1 non-monitors. 
The first case has been converted to a vertex-cut property on 
an auxiliary topology Q* by Lemma [14] we now establish a 
similar condition for the second case using similar arguments. 

Fix a set U' = F' U {m}, where m is a monitor in M 
and F a set of non-monitors. Again, the key observation is 
that each connected component in ^ — U' that contains a 
node in S also containing a monitor is equivalent to each 
such component in Q — M — F containing a neighbor of 
a monitor other than m (i.e., a node in M{M \ {m})). To 
capture this observation, we introduce another auxiliary graph 
Qm := (? —M-|-{m'}-|-£({m'}, A/’(M\{m})) w.r.t. monitor 
m as illustrated in Fig. [2] (c), where m' is again a virtual 
monitor. We will show that the second case {V' contains a 
monitor) is equivalent to requiring that the nodes in S' \ F and 
m' are in the same connected component within Qm — F, and 
thus the following holds. 

Lemma 17. The following two conditions are equivalent: 

(1) Each connected component in Q—V that contains a node 
in S also contains a monitor for \/sets V' containing 
monitor m (m € M) and up to q (q < a — 1) non¬ 
monitors; 

(2) Tg^iS,m’)>q + l. 

Proof: The proof can be found in ll26l . ■ 

Based on Lemmas [TTl and [TtI we can rewrite Lemma [TbI as 
follows. 

Theorem 18 (Ic-identifiability under CSP). Set S is k- 
identifiable under CSP: 

a) ifTg*{S,m')>k-\-2, and min,„gM (S, m') > fc-fl 
(k < a — 2); 
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b) onlyifTg*{S,m') > k + 1, min^GM (S', m') > 

k (k < a — 1). 

Theorem [18] does not include the cases of fc = ct and 
k = a — 1, which are addressed in Propositions [T9] and [ 20 l 

Proposition 19. Set S is a-identifiable under CSP if and only 
if each node in S has at least two monitors as neighbors. 

Proof: The proof can be found in li26l . ■ 

Proposition 20. Set S is (a — l)-identifiable under CSP if 
and only if (i) all nodes in S have at least two monitors as 
neighbors, or (ii) all nodes in N \ {u} (v G S) have at least 
two monitors as neighbors and v has all nodes in N \ {u} 
and one monitor as neighbors. 

Proof: The proof can be found in li26ll . ■ 

Testing algorithm: Similar to the case of CAP, we can use 
the algorithm in lIZTl . li28l to compute the vertex-cuts of the 
auxiliary graphs Q* and Qm (f/rn G M), and test the conditions 
in Theorem [18] for any given k. The overall complexity of the 
test is O{p0f\S\) (refer to Table [I] for notations). 

C. Conditions under UP 

Under UP, monitors have no control over the probing paths 
between monitors, and the set of measurement paths P is 
limited to the paths between monitors determined by the 
network’s native routing protocol. In contrast to the previous 
cases (CAP, CSP), identifiability under UP can no longer 
be characterized in terms of topological properties. We can, 
nevertheless, establish explicit conditions based on the abstract 
conditions in Section IIII-AI The idea is to examine how 
many non-monitors need to be removed to disconnect all 
measurement paths traversing a given non-monitor v. If the 
number is sufficiently large (greater than fc), then we can still 
infer the state of v from some measurement path when a set 
of other non-monitors fail; if the number is too small (smaller 
than or equal to k — 1), then we are not able to determine the 
state of V as the failures of all paths traversing v can already be 
explained by the failures of other non-monitors. This intuition 
leads to the following results. 

In the sequel, C P denotes the set of measurement 
paths traversing a non-monitor v, and C„ := {P^ : w G N, 
w v} denotes the collection of path sets traversing non¬ 
monitors in \ {u}. We use MSC(u) to denote the size of 
the minimum set cover of Py by C„, i.e., MSC(t;) := |U'| for 
the minimum set V C N \ {u} such that P„ C Pw 

Note that covering is only feasible if v is not on any 2-hop 
measurement path (i.e., monitor-u-monitor), in which case we 
know P„ C and thus MSC(u) < cr - 1. If u is 

on a 2-hop path, then we define MSC(v) := a. 

Theorem 21 (fc-identifiability under UP). Set S is k- 
identifiable under UP with measurement paths P: 

a) if MSC{v) > k 1 for any node v in S (k < a — 1); 

b) only if MSC(v) > k for any node v in S (k < a). 

Proof: The proof can be found in li26ll . ■ 

The only case not considered by Theorem [2T] is the case 
that k = a, for which we develop the following condition. 


Proposition 22. Set S is a-identifiable under UP if and only 
if MSC(v) = <7 for any node v in S, i.e., each node in S is 
on a 2-hop path. 

Proof: The proof can be found in li26ll . ■ 

Testing algorithm: The conditions in Theorem [2T] provide 
an explicit way to test fc-identifiability under UP, using tests 
of the form MSC(u) > q. Unfortunately, evaluating such a 
test, known as the decision problem of the set covering prob¬ 
lem, is known to be NP-complete. Nevertheless, we can use 
approximation algorithms to compute bounds on MSC(u). An 
algorithm with the best approximation guarantee is the greedy 
algorithm, which iteratively selects the set in C„ that contains 
the largest number of uncovered paths in P„ until all the paths 
in Py are covered (assuming that v is not on any 2 -hop path). 

Let GSC(u) denote the number of sets selected by the 
greedy algorithm. This immediately provides an upper bound; 
MSC(u) < GSC(v). Moreover, since the greedy algorithm has 
an approximation ratio of log(|P« |)-|-1 li29l . we can also bound 
MSC(t;) from below: MSC(t;) > GSC(t;)/(log(|P„|)-bl). Ap¬ 
plying these bounds to Theorem \Ti\ yields relaxed conditions: 

• S' is fc-identifiable under UP if fc < [min„g 5 io^((^|j+i 1 i 

• S is not fe-identifiable under UP if fc > min^gg GSC(u). 
These conditions can be tested by running the greedy 
algorithm for all nodes in S, each taking time 
0{\Py\'^a) = 0{\P\^a), and the overall test has a complexity 
of 0(|S||Ppcr) (or 0(/i^cr|S|) if there is a measurement path 
between each pair of monitors). 

D. Special Case: 1-identifiability 

In practice, the most common failure event consists of 
the failure of a single node. Thus, an interesting question is 
whether S is 1-identifiable under a given monitor placement 
and a given probing mechanism. In our previous results. 
Theorems [18] and [ 2 T] only provide an answer to the above 
question if the sufficient condition is satisfied or the necessary 
condition is violated for fc = 1 ; however, the answer is 
unknown if S satisfies the necessary condition but violates 
the sufficient condition under CSP and UP. In contrast. The¬ 
orem [15] establishes a condition under CAP that is both 
necessary and sufficient, yet still expressed in a complicated 
form (i.e., vertex-cuts). We develop explicit methods below 
for testing S for 1 -identifiability. 

1) Conditions for 1-identifiability: We start with a generic 
necessary and sufficient condition that applies to all probing 
mechanisms. Recall that denotes the set of measurement 
paths traversing a non-monitor v. For fc = 1, Definition [2}(1) 
is equivalent to the following: 

Claim 23. S is 1-identifiable if and only if: 

(1) Py % for any v G S, and 

(2) Py Py, for any v G S, w G N, and v w. 

In Claim [23 the first condition guarantees that any failure 
in S is detectable (i.e., causing at least one path failure), and 
the second condition guarantees that the observed path states 
can uniquely localize the failed node in S. An efficient test 
of these conditions, however, requires different strategies for 
different probing mechanisms. 
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Fig. 3. Extended graph Q'. 

2) Test under CAP: By Theorem [15] S is 1-identifiable 
under CAP if and only if Tg*{S,m') > 1. This is equivalent 
to requiring that Q* be connected, i.e., Q has one monitor. 

Testing for 1-identifiability of S under CAP is therefore 
reduced to determining if the network has a monitor. 

3) Test under CSP: Under CSP, we derive conditions that 
are equivalent to those in Claim [23 but easier to test. 

Condition (1) in Claim [23 requires that every non-monitor 
in S reside on a monitor-monitor simple path. While an 
exhaustive search for such a path incurs an exponential cost, 
we can test for its existence efficiently using the following 
observation. The idea is to construct an extended graph 
Q' '■= G + {m'} + £({m'}, M), i.e., by adding a virtual 
monitor m' and connecting it to all the monitors; see an 
illustration in Fig. [3] We claim that a non-monitor v is on 
a monitor-monitor simple path if and only if the size of the 
(to', u)-vertex-cut in Q' is at least two, i.e., rgi{v,m') > 2 , 
which implies the existence (see Definition [T2l i of two vertex- 
independent simple paths between v and m', illustrated as 
paths vm 2 m' and vruim' in Fig. [3] Truncating these two 
paths at TO 2 and to^ yields two path segments uto 2 and vrui, 
whose concatenation gives a monitor-to-monitor simple path 
traversing v, i.e., m 2 vmi in Fig. [3] On the other hand, if 
3 a monitor-to-monitor simple path traversing v, then it can 
be split into two simple paths connecting v to two distinct 
monitors, which implies Tgr{v,m') > 2 as each of these two 
distinct monitors connects to to' by a virtual link. 

Condition (2) in Claim [23 is violated if and only if there 
exist two non-monitors v ^ w {at least one of them in 
S) such that all monitor-to-monitor simple paths traversing 
V must traverse w (i.e., Py C P„) and vice versa. Since 
Py C Pyy means that there is no monitor-to-monitor simple 
path traversing v in G — {ru}, by the above argument, we see 
that Py C Py, if and only if the size of the (m', u)-vertex-cut in 
a new graph G'y, ■= G — {w} -f {to'} - 1 -£({to'}, M) is smaller 
than two. Therefore, condition (2) in Claim [23 is satisfied if 
and only if for every two distinct non-monitors v (v G S) and 
w, either the (m', u)-vertex-cut in G'y, or the (to', r(;)-vertex-cut 
in G'y contains two or more nodes. 

In summary, the necessary and sufficient condition for 1- 
identifiability under CSP is: 

i) Tgi{S,m') > 2, and 

ii) Tgi^{v, to') > 2 or rp^(u', to') > 2 for all u G S', w S N, 
and V ^ w. 

Since rg(v,w) > 2 can be tested in 0(|U| -f |L|) timfl the 
overall test takes 0 ((t|S|(|U| -f |L|)) = 0{a{p, + a)'^\S\) time. 

4) Test under UP: Under UP, the total number of mea¬ 
surement paths |P| is reduced to 0{p?) (from exponentially 

^We can compute the biconnected component decomposition (U and test 
if V and w belong to the same biconnected component. 


many as in the case of CAP/CSP) as the measurable routes are 
predetermined. This reduction makes it feasible to directly test 
conditions (1-2) in ClaimjJ^by testing condition (1) for each 
node in S and condition (2) for each pair of non-monitors 
(one of which is in S). Then the overall complexity of is 
0{ap?\S\), dominated by testing of condition (2) in Claim 1^ 

V. Characterization of Maximum Identifiability 
Index 

By Proposition jbj the maximum identifiability index of a 
given set S is the minimum per-node maximum identifiability 
index n{v) for each node i; G S'. It thus suffices to characterize 
the per-node maximum identifiability index for each probing 
mechanism. Under CAP, we give the exact value of Gt{v) 
based on the necessary and sufficient condition in TheoremflSl 
under CSP and UP, we establish tight upper and lower bounds 
on Gt{v) based on the conditions in Theorems [TSl and 1211 

A. Maximum Identifiability Index under CAP 

Since Theorem [TSl provides necessary and sufficient condi¬ 
tions, it directly determines the value of U(u), as stated below. 

Theorem 24 (Maximum Per-node Identifiability under CAP). 
The maximum identifiability index of a non-monitor v under 
CAP is = Fg. (u,to'). 

Evaluation algorithm: As shown in Section HV-Al rg.('(;, 
to') can be computed in 0{9^) time {0: the number of monitor 
neighbors in G, the number of links in G', see Table H]). 
Therefore, f2“’’(5') is computable in 0(6*^|S'|) time. 

B. Maximum Identifiability Index under CSP 

Observing that both the sufficient and the necessary con¬ 
ditions in Theorem [18] are imposed on the same property, 
i.e., vertex-cuts of the auxiliary graph G* and Gm- Let 
5* := Vg*{v,m'), bmin := minmgMrg,„(u,TO'), and := 
min(i5niin, 5* — 1). We obtain a tight characterization of the 
maximum identifiability index under CSP as follows. 

Theorem 25 (Maximum Per-node Identifiability under CSP). 
If TTy < a — 2, the maximum identifiability index of a non¬ 
monitor V under CSP is bounded by TTy — 1 < U“'’(u) < 7 r„. 

Proof: The proof can be found in li26ll . ■ 

Remark: Because the set of links in Gm is a subset of those 
in G* while the nodes are the same, we always have (^min < 5*. 
Therefore, the above bounds simplify to: 

. - 2 < U“^(u) < - 1 if = (5*; 

• bmin - 1 < U“’’(u) < (5niin if <5min < 

In particular, if 6* = 1, then it implies that 3 a node w G N 
in G*, where all simple paths starting at v and terminating 
at to' must traverse w, i.e., $ simple monitor-to-monitor paths 
traversing v (Py = 0); therefore ^''^’’(z;) = 0 (even single-node 
failures in S cannot always be localized if u G S). 

The only cases when TTy < cr — 2 is violated are: (i) bmin = 
5* = a, or (ii) bmin = a — 1 and S* = a. In case (i), non¬ 
monitor V still has a monitor as a neighbor after removing to; 
by Proposition [19] this implies that U“‘’(u) = a. In case (ii). 
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Theorem [TSl (a) can still be applied to show that > 

O’ — 2, and one can verify that the condition in Proposition [19] 
is violated, which implies that < a — 1. In fact, we 

can leverage Proposition l20l to uniquely determine in 

this case. If the conditions in Proposition l20l are satisfied, then 
= cr — 1; otherwise, r2“’’(u) = cr — 2. 

Evaluation algorithm: Evaluating by Proposition]^ 

involves computing n(u) for all v G S, each requiring the 
computation of the vertex-cuts of the auxiliary graphs Q* and 
Qm (Vm G M) as that in Section ITV-AI which altogether takes 
0{^9^\S\) time. 


C. Maximum Identifiability Index under UP 

As in the case of CSP, we can leverage the sufficient and the 
necessary conditions in Theorem ]2T] to bound the maximum 
identifiability index under UP from both sides. The conditions 
in Theorem ]2T] imply the following bounds on the maximum 
identifiability index under UP. 


Theorem 26 (Maximum Per-node Identifiability under UP). 
The maximum identifiability index of a non-monitor v under 
UP with measurement paths P is bounded by MSC(y) — 1 < 
< MSCiv). 


Proof: The proof can be found in li26l . ■ 

Evaluation algorithm: The original bounds in Theorem |2^ 
are hard to evaluate due to the NP-hardness of computing 
MSC( ). As in Section lTV-CI we resort to the greedy algorithm, 
which implies the following relaxed bounds: 


r GSC(t;) 1 

log(|n|) + l 


1 < < GSC(t;). 


(5) 


Evaluating these bounds for involves invoking the 

greedy algorithm for each node in S, with an overall complex¬ 
ity of 0(|5||Pp(7) (or 0{p,‘^a\S\) if all monitors can probe 
each other). 


VI. Characterization of the Maximum 
Identifiable Set 

By Proposition 0 the maximum /c-identifiable set S*{k) is 
related to the per-node maximum identifiability index U(u) by 
S*{k) = {v G N : fl{v) > k}. Therefore, S*{k) can be easily 
computed based on values of U(u) {v G N) for any value of 
k. Moreover, given upper/lower bounds on ^(u), i.e., ^i{v) < 
Gl{v) < U„(u), S*{k) can be bounded by C S*{k) C 

for := {v G N : > k} and S°'““{k) := 

{v G N : flu(v) > fc}. Based on this observation, we now 
characterize S*{k) for each of the three probing mechanisms. 


A. Maximum k-identifiable Set under CAP 

The expression of the maximum per-node identifiability un¬ 
der CAP in Theorem]24]leads to the following characterization 
of the maximum fc-identifiable set. 

Corollary 27. The maximum k-identifiable set under CAP, 
denoted by S*^p{k), is S*,^J,{k) = {v G N : Tgt{v,m') > k}. 

Specifically, when k = a, Sfpj,{a) contains all the non¬ 
monitors directly adjacent to monitors. 


Evaluation algorithm: As shown in Section ITV-Al Tp. (u, 
to') can be computed in O{0^) time. Thus, the total time 
complexity for constructing Sfpj,{k) is OiO^a). 


B. Maximum k-identifiable Set under CSP 

Leveraging Theorem ]25] we can establish outer and in¬ 
ner bounds (i.e., superset and subset) for the maximum k- 
identifiable set under CSP. 

Corollary 28. Let Sf"f{k) := {v G N : > k}, and 

Sf'fpfk) := {v G N : TTy > A:-|-l}. The maximum k-identifiable 
set under CSP (k < a — 1), denoted by S*p.p{k), is bounded 
by Sffpfk) C S*pp{k) C S'“7(fc). 

Proof: The proof can be found in ll26l . ■ 

One case not covered by Corollary ]28] is fc = cr. In this 
case, ^csp (cr) contains all non-monitors that have at least two 
monitors as neighbors according to Proposition [19] Another 
non-covered case is k = a — 1, for which we have the 
following result. 

Corollary 29. When k = a — 1, = {v G N : v has 

at least two monitor neighbors} U S. Set S contains one and 
only one non-monitor w if all nodes in N but w have at least 
two monitor neighbors and w has one monitor and all nodes 
in N \ {lu} as neighbors; otherwise, S' = 0. 

Proof: The proof can be found in li26ll . ■ 

Corollary |29] implies that when S is not empty (i.e., |S| = 
1), then S*sp(cr — 1) = and S*sp(cr) = N\S (i.e., |S*sp(cr — 
1)1 = cr and |S*sp(cr)| = cr - 1). 

Evaluation algorithm: Corollary[^is computable in linear 
time. Similar to Section llV-BI 7r„ in Corollary |28] is in 0{p,9f) 
complexity. Therefore, the overall complexity is 0{p,9^a). 


C. Maximum k-identifiable Set under UP 

Analogous to the case of CSP, we leverage Theorem |2^ to 
develop the following outer and inner bounds for the maximum 
fc-identifiable set under UP. 

Corollary 30. Let S“‘f'{k) ■= [v G N ■. MSC{v) > k} and 
Syp"{k) := {v G N : MSC{v) > k -\- 1} with measurement 
paths P. The maximum k-identifiable set under UP (k < a — 
1), denoted by S*p{k), is bounded by S‘"f'{k) C S*p{k) C 

s~;fk). 

Proof: The proof can be found in li^ . ■ 

A special case left out by Corollary [^ is fc = cr. In this 
case, we use Proposition]22]to determine S*p (d), i.e., S*p(cr) = 
{w G N : w is on a 2-hop path}. 

Evaluation algorithm: Due to the NP-hardness of comput¬ 
ing MSC(-), we again resort to the greedy algorithm, whereby 
the outer and inner bounds of S*p(fc) can be relaxed by 
computing GSC(-). Let S°p"{k) := {v G N : GSC(t;) > k} 
and S™"(A:) ■= {v G N : GSC(u)/(log(|Uu|) -f l) > fc -I-1}. 
We have S“‘"(fc) C S“p"(fc) and Sfp‘'(k) C Sy°‘fk) according 
to Proposition [8] The computation of these relaxed bounds 
involves 0(cr|Pp) time complexity w.r.t. each node in N. 
Thus, the overall complexity is 0(cr^|Pp). 
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(a) /i = 2 (b) /i = 10 (a) = 200 (b) fi = 346 


Fig. 4. Maximum fc-identifiable set S* (k) under CAP, CSP, and UP for ER 
graphs (|U| = 20, fj, = {2,10}, E[|ij|] = 51, 200 graph instances, cr: total 
number of non-monitors). 


Fig. 6. Maximum fc-identifiable set S*{k) under CAP, CSP, and UP for 
CAIDA AS26788 (|U| = 355, \L\ = 483, /r = {200,346}, 100 Monte 
Carlo runs, a: total number of non-monitors). 



k k 


(a) /i = 50 (b) /r = 163 

Fig. 5. Maximum fc-identifiable set S*(k) under CAP, CSP, and UP for 
Rocketfuel AS1755 (|U| = 172, \L\ = 381, /r = {50,163}, 100 Monte 
Carlo runs, cr: total number of non-monitors). 

VII. Evaluation of Failure Localization 
Capability 

We demonstrate how the proposed measures of maximum 
identifiability index and maximum identifiable set can be 
used to evaluate the impact of various parameters, including 
topology, number of monitors, and probing mechanisms (CAP, 
CSP, UP), on the capability of failure localization. In this 
study, we assume (hop count-based) shortest path routing as 
the default routing protocol under UP, i.e., measurement paths 
under UP are the shortest paths between monitors, with ties 
broken arbitrarily. 

A. Topologies for Evaluation 

We first employ random graph models to generate a compre¬ 
hensive set of topologies without artifacts of specific network 
deployments. We consider random Erdos-Renyi (ER) graphs 
E], generated by independently connecting each pair of 
nodes by a link with a fixed probability p. The result is a purely 
random topology where all graphs with an equal number of 
links are equally likely to be selected (note that the number of 
nodes is an input parameter). In addition to ER graphs, other 
random graph models are also considered; the corresponding 
results are presented in ll2^ due to space limitation. 

We then evaluate real Autonomous System (AS) topolo¬ 
gies collected by the Rocketfuel ll^ and the CAIDA ll^ 
projects, which represents IP-level connections between back¬ 
bone/gateway routers of several ASes from major Internet 
Service Providers (ISPs) around the globe. 

B. Evaluation Results 

We focus on evaluating per-node maximum identifiability 
index n{v) since it determines both the per-set maximum 


identifiability index U(5') and the maximum identifiable set 
S*{k). In particular, the complementary cumulative distribu¬ 
tion function (CCDF) of U(u) over all u g (refer to Table J] 
for notations) coincides with the normalized cardinality of the 
maximum identifiable set \S*{k)\/a, and thus we character¬ 
ize the distribution of U(z;) by evaluating |S'*(fc)|/tT wrt k. 
Moreover, we examine the specific value of H{v) and compare 
it with the degree (i.e., number of neighbors) of v among 
monitor/non-monitor nodes to evaluate the correlation between 
maximum identifiability index and the graph property (i.e., 
degree) of a node. Under UP, our extensive simulations under 
multiple graph models have shown that MSC(u) can be closely 
approximated by GSC(u); hence, we use GSC(z;) in place of 
MSC(u) for computing and see 12^ for details. 

1) Distribution of El(v): To characterize the overall distri¬ 
bution of D,{v), we compute (bounds onj§ SUk). SUk), 
and S*^(k) to evaluate \S*{k)\/a for different values of k 
(a: total number of non-monitors). Fig. |4] reports averages of 
\S*{k)\/a computed on ER graphs over randomly selected 
multiple instances of topology and monitor locations, where 
\S*{k)\/a under CSP and UP is represented by a band with 
its width determined by (|5'°'""(fc)| — |/S'™“(A:)|)/(T. The results 
show large differences in the failure localization capabilities of 
different probing mechanisms: When the number of monitors 
is small (p = 2) and k = 2, S*^(k) is almost empty, i.e., no 
(non-monitor) node state can be uniquely determined by UP 
when there are multiple failures; in contrast, |S'*j,p(fc)|/(T Ri 0.5 
and |>S'*^p(A:)|/ct « 1, i.e., CSP can uniquely determine the 
states of half of the nodes and CAP can determine the states 
of all the nodes when p = 2 and k = 2. When the number of 
monitors increases (p = 10), there exist more measurement 
paths between monitors, and thus the fraction of identifiable 
nodes increases for all three probing mechanisms. In addition, 
we observe a stable phase in Fig. 0] where the value of 
\S*{k)\/a remains the same as we increase fc; this is because 
some non-monitors have monitors as neighbors, thus directly 
measurable by these neighboring monitors without traversing 
other non-monitors. Specifically, if there are non-monitors that 
neighbor at least one monitor under CAP, neighbor at least two 
monitors under CSP, or lie on 2-hop paths between monitors 
under UP, then the failure of these non-monitors can always 
be identified regardless of the total number of failures in the 
network, i.e., the maximum identifiability index of these non¬ 
monitors is the total number of non-monitors. Note that in 

^Propositions [H Corollary [23 and Proposition 1221 are used to determine 
the exact elements in <S'csp(cr), SQ^p(cr — 1), and S^p{cr). 
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Fig. m the number of such directly measurable non-monitors 
is smaller under UP than under CSR This is because for non¬ 
monitors that neighbor the same pair of monitors (e.g., mi and 
m 2 ), all these non-monitors are directly measurable on 2-hop 
mi-to-m 2 paths under CSP; however, only one of these non¬ 
monitors is on a 2-hop mi-to-m 2 path under UP as UP probes 
only one routing path between each pair of monitors (assuming 
stable single-path routing). Similar results have been obtained 
for other random graph models (see ll^ for details). 

We repeat the above evaluation on AS topologies. We select 
AS 1755 from Rocketfuel topologies ll^ and AS26788 from 
CAIDA topologies ll^ . and evaluate the bounds on |S'*(fc)|/cr 
under multiple instances of random monitor placements; av¬ 
erage results are reported in Fig. |5] and |6] Similar to the 
case of random topologies, there are clear differences between 
different probing mechanisms. Unlike the uniformly connected 
random topologies in Fig. HI these AS topologies contain many 
sparse subgraphs where the removal of a few nodes can discon¬ 
nect the network. Thus, unless a node is directly measurable 
by monitors, it is likely that failures of a few other nodes 
will disconnect it from monitors and thus make its failure 
undetectable. Comparing results from Rocketfuel and CAIDA, 
we observe that the CAIDA AS requires more monitors to 
achieve the same level of identifiability. Moreover, deploying 
more monitors in CAIDA AS only slightly improves the level 
of identifiability. This can be explained by examining the link 
density |L|/|U| of the network: |L|/|U| = 1.36 for the CAIDA 
AS, whereas |L|/|U| = 2.22 for the Rocketfuel AS, i.e., 
CAIDA AS topology is nearly a tree. Therefore, it is likely for 
a node to not reside on any paths between monitors or become 
unmeasurable after the failure of one other node in the CAIDA 
AS, even if the paths are controllable but cycle-free (CSP). 

2) Correlation of ft{v) and Degree: Next, we examine 
specific values of Cl{v) for each non-monitor v € N for 
selected instances of network topology and monitor placement. 
Our goal is to compare these values with node degrees to 
understand the correlation between the proposed identifiability 
measure and typical graph-theoretic node properties. Specifi¬ 
cally, we sort non-monitors in a non-increasing order of U(z;) 
under each of the three probing mechanisms, and compare 
n(v) with the degrees of v among monitors/non-monitorfl 
see results in Fig. |7]for random topologies and in Fig. |8}|9]for 
AS topologies. The results show strong correlations between 
U(u) and the degree of v, denoted by d(u). Specifically, denote 
the number of neighbors of v that are monitors by d’”(u) 
and the number of neighbors of v that are non-monitors by 
d"’(u); the overall degree d(u) = d’”(r;)-|-d"(z;). If node v has 
sufficient monitor neighbors (d™(?;) > 1 for CAP, d™(z;) > 2 
for CSP), then v is directly measurable and thus Cl{v) = a 
regardless of the actual degree of u; if node v does not have a 
sufficient number of monitors as neighbors, then H{v) <d(?;) 
because if all neighbors of v fail, then the state of v cannot be 
determined by path measurements. However, in the latter case, 
d(u) is only a loose upper bound, and the exact value of U(u) 
depends on the overall topology, the locations of monitors, 

^Note that node IDs are different under different probing mechanisms due 
to the different order of f2(i;) values. 


and the constraints on measurement paths. In this regard, our 
result can also be viewed as defining a new node property 
(U(z;)) that takes into account all these parameters. 

Overall, we observe that CAP-type probing is hugely ad¬ 
vantageous in uniquely monitoring node states under failures, 
especially when there are multiple failures and the network is 
sparse. This implies that in the absence of deploying monitors 
at every node, implementing controllable probing is an effec¬ 
tive way to uniquely localize node failures. Our observation 
also stresses the importance of optimized monitor placement, 
especially when we are only interested in monitoring a subset 
of nodes, which is left to future work. 

VIII. Conclusion 

We studied the fundamental capability of a network in local¬ 
izing failed nodes from binary measurements (normal/failed) 
of paths between monitors. We proposed two novel mea¬ 
sures: maximum identifiability index that quantifies the scale 
of uniquely localizable failures wrt a given node set, and 
maximum identifiable set that quantifies the scope of unique 
localization under a given scale of failures. We showed that 
both measures are functions of the maximum identifiability 
index per node. We studied these measures for three types 
of probing mechanisms that offer different controllability of 
probes and complexity of implementation. For each prob¬ 
ing mechanism, we established necessary/sufficient conditions 
for unique failure localization based on network topology, 
placement of monitors, constraints on measurement paths, 
and scale of failures. We further showed that these con¬ 
ditions lead to tight upper/lower bounds on the maximum 
identifiability index, as well as inner/outer bounds on the 
maximum identifiable set. We showed that both the conditions 
and the bounds can be evaluated efficiently using polynomial¬ 
time algorithms. Our evaluations on random and real network 
topologies showed that probing mechanisms that allow moni¬ 
tors to control the routing of probes have significantly better 
capability to uniquely localize failures. 
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(a) Under CAP (b) Under CSP (c) Under UP 

Fig. 7. Node maximum identifiability index Q{v) of one ER graph under different probing mechanisms (|U| = 20, ^ = 4, E[|L|] = 51). 




(a) Under CAP (b) Under CSP 



Non-monitor ID: v 
(c) Under UP 


Fig. 8. Node maximum identifiability index Q{v) of Rocketfuel AS1755 under different probing mechanisms {|y| = 172, \L\ = 381, = 70). 





(a) Under CAP (b) Under CSP (c) Under UP 

Fig. 9. Node maximum identifiability index Q{v) of CAIDA under different probing mechanisms (|V| = 355, \L\ = 483, = 296). 
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